1. What are the aims of using non-supervised learning (or clustering)?
Clustering data can be described as ‘the art of finding groups in data’. Classifying similar objects into groups is an important and natural human activity and a prominent part of science, history, government, education and marketing. In the past, clustering and categorising has usually been quite subjective, and down to the judgment of the researcher.
Larger and more complex data sets have seen the rise of automatic categorisation procedures, which is where clustering techniques fit in to the world of data science. The main reasons for using this kind of technique are:
- identify distinct groups within a data set
- as an extension of exploratory data analysis
- gain insight into how the data and how variables relate to one another
- improve supervised/predictive analysis
In this project, each of the above are relevant.
Since its creation as a Roman city over 2000 years ago, London’s growth has been haphazard and random, compared to the more planned and gradual expansion of other cities. But even so, when we break up the boroughs by OA, can we see distinct clusters of neighbourhoods, like Victorian terraces, postwar tower blocks, suburban cul-de-sacs and modern blocks of flats?
Given that OAs are purely administrative inventions - can we see clusters like this emerge? Or is the data too noisy?
The aim of this part of the analysis is to run, explain and measure the performance of buildings clusters on our data set. By this, I mean I will come up with what I think
Summary of data being clustered
At the moment, this data is only for Lewisham Council. I will be running the cluster on buildings data, which has been split into categories for type and age.
Type: Flat block | Converted Flats | House (detached/semi-detached) | Terraced house
Age: Victorian/pre-WW1 | Interwar | Postwar (1945-1979) | Modern (1980-)
For each OA, the percentage of addresses that fall into each of the above categories is calculated. The clusters are based on these variables. What we are looking for here is whether or not OAs fit into neat categories of building types.
This analysis is carried out using k-means clustering, where we need to specify the number of categories, then the algorithms find that number of categories in the data. We call this unsupervised learning, since we are not defining what the categories are.
Our first step therefore is to try and decide how many categories there are. A common way of doing is using the elbow method. See the next slide.
2. Main findings and conclusions
The clustering here is not as well defined as we thought it might be. We think there are two main reasons for this:
- Streets in London, develop gradually over time and sometimes less than orderly manner, especially compared to places like Paris.
- OAs are a quite arbitrary boundaries, drawn up for administrative purposes not to reflect any kind of neighbourhood
The next stage is to see whether this clustering works any better for streets and other combinations of variables.
4. Details about each of the clusters
[1] 0.0000000 -0.4335308 1.0537517 -0.5010226 0.0000000 -0.6921756
In the above example, the resulting five clusters, that are unevenly distributed.